IRIX Base Documentation 2002 November

home *** CD-ROM | disk | FTP | other *** search

/ IRIX Base Documentation 2002 November / SGI IRIX Base Documentation 2002 November.iso / usr / share / catman / u_man / cat1 / perfex.z / perfex

Wrap

Text File | 2002-10-03 | 32.0 KB | 661 lines

PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) NNNNAAAAMMMMEEEE ppppeeeerrrrffffeeeexxxx - Command line interface to processor event counters SSSSYYYYNNNNOOOOPPPPSSSSIIIISSSS ppppeeeerrrrffffeeeexxxx [----aaaa | ----eeee _e_v_e_n_t_0 [----eeee _e_v_e_n_t_1]] [----mmmmpppp |----ssss | ----pppp] [----pppppppp [_t_i_d]] [----xxxx] [----kkkk] [----yyyy] [----tttt] [----TTTT] [----oooo _f_i_l_e] [----cccc _f_i_l_e] _c_o_m_m_a_n_d DDDDEEEESSSSCCCCRRRRIIIIPPPPTTTTIIIIOOOONNNN The given _c_o_m_m_a_n_d is executed; after it is complete, _p_e_r_f_e_x prints the values of various hardware performance counters. The counts returned are aggregated over all processes that are descendants of the target command, as long as their parent process controls the child through wwwwaaaaiiiitttt (see wwwwaaaaiiiitttt(2)). The R10000 event counters are different from R12000 event counters. See the rrrr11110000kkkk____ccccoooouuuunnnntttteeeerrrrssss(5) man page for differences. For R10000 CPUs, the integers _e_v_e_n_t_0 and _e_v_e_n_t_1 index the following table: 0 = Cycles 1 = Issued instructions 2 = Issued loads 3 = Issued stores 4 = Issued store conditionals 5 = Failed store conditionals 6 = Decoded branches. (This changes meaning in 3.x versions of R10000. It becomes resolved branches). 7 = Quadwords written back from secondary cache 8 = Correctable secondary cache data array ECC errors 9 = Primary (L1) instruction cache misses 10 = Secondary (L2) instruction cache misses 11 = Instruction misprediction from secondary cache way prediction table 12 = External interventions 13 = External invalidations 14 = Virtual coherency conditions. (This changes meaning in 3.x versions of R10000. It becomes ALU/FPU forward progress cycles. On the R12000, this counter is always 0). 15 = Graduated instructions 16 = Cycles 17 = Graduated instructions 18 = Graduated loads 19 = Graduated stores 20 = Graduated store conditionals 21 = Graduated floating point instructions 22 = Quadwords written back from primary data cache 23 = TLB misses 24 = Mispredicted branches 25 = Primary (L1) data cache misses 26 = Secondary (L2) data cache misses 27 = Data misprediction from secondary cache way prediction table 28 = External intervention hits in secondary cache (L2) 29 = External invalidation hits in secondary cache 30 = Store/prefetch exclusive to clean block in secondary cache PPPPaaaaggggeeee 1111 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) 31 = Store/prefetch exclusive to shared block in secondary cache For R12000 CPUs, the integers _e_v_e_n_t_0 and _e_v_e_n_t_1 index the following table: 0 = Cycles 1 = Decoded instructions 2 = Decoded loads 3 = Decoded stores 4 = Miss handling table occupancy 5 = Failed store conditionals 6 = Resolved conditional branches 7 = Quadwords written back from secondary cache 8 = Correctable secondary cache data array ECC errors 9 = Primary (L1) instruction cache misses 10 = Secondary (L2) instruction cache misses 11 = Instruction misprediction from secondary cache way prediction table 12 = External interventions 13 = External invalidations 14 = ALU/FPU progress cycles. (This counter in current versions of R12000 is always 0). 15 = Graduated instructions 16 = Executed prefetch instructions 17 = Prefetch primary data cache misses 18 = Graduated loads 19 = Graduated stores 20 = Graduated store conditionals 21 = Graduated floating-point instructions 22 = Quadwords written back from primary data cache 23 = TLB misses 24 = Mispredicted branches 25 = Primary data cache misses 26 = Secondary data cache misses 27 = Data misprediction from secondary cache way prediction table 28 = State of intervention hits in secondary cache (L2) 29 = State of invalidation hits in secondary cache 30 = Store/prefetch exclusive to clean block in secondary cache 31 = Store/prefetch exclusive to shared block in secondary cache BBBBAAAASSSSIIIICCCC OOOOPPPPTTTTIIIIOOOONNNNSSSS ----eeee _e_v_e_n_t Specify an event to be counted. 2, 1, or 0 event specifiers may be given, the default events being to count cycles. Events may also be specified by setting one or both of the environment variables TTTT5555____EEEEVVVVEEEENNNNTTTT0000 and TTTT5555____EEEEVVVVEEEENNNNTTTT1111. Command line event specifiers, if present, override the environment variables. The order of events specified is not important. The counts, together with an event description, are written to ssssttttddddeeeerrrrrrrr unless redirected with the ----oooo option. Two events that mmmmuuuusssstttt be counted on the same hardware counter (see rrrr11110000kkkk____ccccoooouuuunnnntttteeeerrrrssss(5)) will cause a conflicting counters error. PPPPaaaaggggeeee 2222 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) ----aaaa Multiplexes over all events, projecting totals. Ignores event specifiers. The option ----aaaa produces counts for all events by multiplexing over 16 events per counter. The OS does the switching round robin at clock interrupt boundaries. The resulting counts are normalized by multiplying by 16 to give an estimate of the values they would have had for exclusive counting. Due to the equal-time nature of the multiplexing, events present in large enough numbers to contribute significantly to the execution time will be fairly represented. Events concentrated in a few short regions (for instance, instruction cache misses) may not be projected very accurately. ----mmmmpppp Report per-thread counts for multiprocessing programs as well as (default) totals. By default, ppppeeeerrrrffffeeeexxxx aggregates the counts of all the child threads and reports this number for each selected event. The ----mmmmpppp option causes the counters for each thread to be collected at thread exit time and printed out; the counts aggregated across all threads are printed next. The per-thread counts are labeled by process ID (pid). ----pppppppp Report per-pthread counts for multiprocessing programs. ppppeeeerrrrffffeeeexxxx ----pppppppp ttttiiiidddd displays the counts of the pthread with thread id ttttiiiidddd. ppppeeeerrrrffffeeeexxxx ----mmmmpppp ----pppppppp displays the counts of all pthreads associated with the process. If pthread 0 is chosen, all of the pthread counts will be displayed. The ----pppppppp option causes the counters for the thread to be collected at thread exit time and printed out; The per-pthread counts are labeled by thread ID (tid). ----oooo _f_i_l_e Redirects ppppeeeerrrrffffeeeexxxx output to the specified file. In the ----mmmmpppp case, the file name includes the pid of the sssspppprrrroooocccc child thread. ----ssss Starts (or stops) counting when a SSSSIIIIGGGGUUUUSSSSRRRR1111 (or SSSSIIIIGGGGUUUUSSSSRRRR2222) signal is received by a ppppeeeerrrrffffeeeexxxx process. PPPPaaaaggggeeee 3333 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) ----pppp _p_e_r_i_o_d Profiles (samples) the counters with the given period. This option causes ppppeeeerrrrffffeeeexxxx to wait until it (i.e., the ppppeeeerrrrffffeeeexxxx process) receives a SSSSIIIIGGGGUUUUSSSSRRRR1111 before it starts counting (for the child process, the target). It will stop counting if it receives a SSSSIIIIGGGGUUUUSSSSRRRR2222. Repeated cycles of this will aggregate counts. If no SSSSIIIIGGGGUUUUSSSSRRRR2222 is received (the usual case), the counting will continue until the child exits. Note that counting for descendants of the child will not be affected, meaning counting for mp programs cannot be controlled with this option. ----xxxx Counts at exception level (as well as the default user level). Exception level includes time spent on behalf of the user during, for example, TLB refill exceptions. Other counting modes (kernel, supervisor) are available through the OS iiiiooooccccttttllll interface (see rrrr11110000kkkk____ccccoooouuuunnnntttteeeerrrrssss(5) ). ----kkkk Counts at kernel level (as well as user and exception level, if set), program superuser privileges. EEEEXXXXAAAAMMMMPPPPLLLLEEEE To collect instruction and data secondary cache miss counts on a program normally executed by % bar < bar.in > bar.out would be accomplished by % perfex -e 26 -e 10 bar < bar.in > bar.out . CCCCOOOOSSSSTTTT EEEESSSSTTTTIIIIMMMMAAAATTTTEEEE OOOOPPPPTTTTIIIIOOOONNNNSSSS ----yyyy Report statistics and ranges of estimated times per event. Without the ----yyyy option, ppppeeeerrrrffffeeeexxxx reports the counts recorded by the event counters for the events requested. Since they are simply raw counts, it is difficult to know by inspection which events are responsible for significant portions of the job's run time. The ----yyyy option associates time cost with some of the event counts. The reported times are approximate. Due to the superscalar nature of the R10000 and R12000 CPUs, and their ability to hide latency, stating a precise cost for a single occurrence of many of the events is not possible. Cache misses, for example, can be overlapped with PPPPaaaaggggeeee 4444 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) other operations, so there is a wide range of times possible for any cache miss. To account for the fact that the cost of many events cannot be known precisely, ppppeeeerrrrffffeeeexxxx ----yyyy reports a range of time costs for each event. "Maximum," "minimum," and "typical" time costs are reported. Each is obtained by consulting an internal table that holds the maximum, minimum, and typical costs for each event, and multiplying this cost by the count for the event. Event costs are usually measured in terms of machine cycles, and so the cost of an event generally depends on the clock speed of the processor, which is also reported in the output. The maximum value contained in the table corresponds to the worst case cost of a single occurrence of the event. Sometimes this can be a very pessimistic estimate. For example, the maximum cost for graduated floating-point instructions assumes that all such instructions are double precision reciprocal square roots, since that is the most costly floating-point instruction. Due to the latency-hiding capabilities of the CPUs, the minimum cost of virtually any event could be zero, since most events can be overlapped with other operations. To avoid simply reporting minimum costs of 0, which would be of no practical use, the minimum time reported by ppppeeeerrrrffffeeeexxxx ----yyyy corresponds to the "best case" cost of a single occurrence of the event. The best case cost is obtained by running the maximum number of simultaneous occurrences of that event and averaging the cost. For example, two floating-point instructions can complete per cycle, so the best case cost on the R10000 is 0.5 cycles per floating-point instruction. The typical cost falls somewhere between minimum and maximum and is meant to correspond to the cost one would expect to see in average programs. For example, to measure the typical cost of a cache miss, stride-1 accesses to an array too big to fit in cache were timed, and the number of cache misses generated was counted. The same number of stride-1 accesses to an in-cache array were then timed. The difference in times corresponds to the cost of the cache misses, and this was used to calculate the average cost of a cache miss. This typical cost is lower than the worst case in which each cache miss cannot be overlapped, and it is higher than the best case, in which several independent, and hence, overlapping, cache misses are generated. (Note that on Origin systems, this methodology yields the time for secondary cache misses to local memory only.) Naturally, these typical costs are somewhat arbitrary. If they do not seem right for the application being measuring by ppppeeeerrrrffffeeeexxxx, they can be replaced by user-supplied values. See the ----cccc option below. ppppeeeerrrrffffeeeexxxx ----yyyy prints the event counts and associated cost estimates sorted from most costly to least costly. While resembling a profiling output, it is not a true profile. The event costs reported are only estimates. Furthermore, since events do overlap with each PPPPaaaaggggeeee 5555 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) other, the sum of the estimated times will usually exceed the program's run time. This output should only be used to identify which events are responsible for significant portions of the program's run time and to get a rough idea of what those costs might be. With this in mind, the built-in cost table does not make an attempt to provide detailed costs for all events. Some events provide summary or redundant information. These events are assigned minimum and typical costs of 0, so that they sort to the bottom of the output. The maximum costs are set to 1 cycle, so that you can get an indication of the time corresponding to these events. Issued instructions and graduated instructions are examples of such events. In addition to these summary or redundant events, detailed cost information has not been provided for a few other events, such as external interventions and external invalidations, since it is difficult to assign costs to these asynchronous events. The built-in cost values may be overridden by user-supplied values using the ----cccc option. In addition the event counts and cost estimates, ppppeeeerrrrffffeeeexxxx ----yyyy also reports a number of statistics derived from the typical costs. The meaning of many of the statistics is self-evident (for example, graduated instructions/cycle). The following are statistics whose definitions require more explanation. These are available with both R10000 and R12000 CPUs. Data mispredict/Data secondary cache hits This is the ratio of the counts for data misprediction from secondary cache way prediction table and secondary data cache misses. Instruction mispredict/Instruction secondary cache hits This is the ratio of the counts for instruction misprediction from secondary cache way prediction table and secondary instruction cache misses. Primary cache line reuse The is the number of times, on average, that a primary data cache line is used after it has been moved into the cache. It is calculated as graduated loads plus graduated stores minus primary data cache misses, all divided by primary data cache misses. PPPPaaaaggggeeee 6666 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) Secondary Cache Line Reuse The is the number of times, on average, that a secondary data cache line is used after it has been moved into the cache. It is calculated as primary data cache misses minus secondary data cache misses, all divided by secondary data cache misses. Primary Data Cache Hit Rate This is the fraction of data accesses that are satisfied from a cache line already resident in the primary data cache. It is calculated as 1.0 - (primary data cache misses divided by the sum of graduated loads and graduated stores). Secondary Data Cache Hit Rate This is the fraction of data accesses that are satisfied from a cache line already resident in the secondary data cache. It is calculated as 1.0 - (secondary data cache misses divided by primary data cache misses). Time accessing memory/Total time This is the sum of the typical costs of graduated loads, graduated stores, primary data cache misses, secondary data cache misses, and TLB misses, divided by the total program run time. The total program run time is calculated by multiplying cycles by the time per cycle (the inverse of the processor's clock speed). Primary-to-secondary bandwidth used (MB/s, average per process) This is the amount of data moved between the primary and secondary data caches, divided by the total program run time. The amount of data moved is calculated as the sum of the number of primary data cache misses multiplied by the primary cache line size and the number of quadwords written back from primary data cache multiplied by the size of a quadword (16 bytes). For multiprocess programs, the resulting figure is a per-process average, since the counts measured by ppppeeeerrrrffffeeeexxxx are aggregates of the counts for all the threads. You must multiply by the number of threads to get the total program bandwidth. Memory bandwidth used (MB/s, average per process) This is the amount of data moved between the secondary data cache and main memory, divided by the total program run time. The amount of data moved is calculated as the sum of the number of secondary data cache misses multiplied by the secondary cache line size and the number of quadwords written back from secondary data cache multiplied by the size of a quadword (16 bytes). For multiprocess programs, the resulting figure is a per-process average, since the counts measured by ppppeeeerrrrffffeeeexxxx are aggregates of the counts for all the PPPPaaaaggggeeee 7777 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) threads. You must multiply by the number of threads to get the total program bandwidth. MFLOPS (MB/s, average per process) This is the ratio of the graduated floating-point instructions and the total program run time. Note that while a multiply-add carries out two floating-point operations, it only counts as one instruction, so this statistic may underestimate the number of floating-point operations per second. For multiprocess programs, the resulting figure is a per-process average, since the counts measured by ppppeeeerrrrffffeeeexxxx are aggregates of the counts for all the threads. You must multiply by the number of threads to get the total program rate. The following statistics are computed only on R12000 CPUs: Cache misses in flight per cycle (average) This is the count of event 4 (Miss Handling Table (MHT) population) divided by cycles. It can range between 0 and 5 and represents the average number of cache misses of any kind that are outstanding per cycle. Prefetch miss rate This is the count of event 17 (prefetch primary data cache misses) divided by the count of event 16 (executed prefetch instructions). A high prefetch miss rate (about 1) is desirable, since prefetch hits are wasting instruction bandwidth. A statistic is only printed if counts for the events which define it have been gathered. ----cccc _f_i_l_e Load a cost table from _f_i_l_e (requires that ----yyyy is specified). This option allows you to override the internal event costs used by the ----yyyy option. _f_i_l_e contains the list of event costs that are to be overridden. This _f_i_l_e must be in the same format as the output produced by the ----cccc option. Costs may be specied in units of "clks" (machine cycles) or "nsec" (nanoseconds). You can override all or only a subset of the default costs. You can also use the file ////eeeettttcccc////ppppeeeerrrrffffeeeexxxx....ccccoooossssttttssss to override event costs. If this file exists, any costs listed in it will override those built into ppppeeeerrrrffffeeeexxxx. Costs supplied with the ----cccc option will override those provided by the ////eeeettttcccc////ppppeeeerrrrffffeeeexxxx....ccccoooossssttttssss file. ----tttt Print the cost table used for ppppeeeerrrrffffeeeexxxx ----yyyy cost estimates to ssssttttddddoooouuuutttt. These internal costs can be overridden by specifying different values in the file ////eeeettttcccc////ppppeeeerrrrffffeeeexxxx....ccccoooossssttttssss or by using the ----cccc _f_i_l_e option. PPPPaaaaggggeeee 8888 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) Both _f_i_l_e and ////eeeettttcccc////ppppeeeerrrrffffeeeexxxx....ccccoooossssttttssss must use the format as provided by the ----tttt option. It is recommended that you capture this output to a file and edit it to create a suitable file for ////eeeettttcccc////ppppeeeerrrrffffeeeexxxx....ccccoooossssttttssss or the ----cccc option. You do not have to specify costs for every event, however. Lines corresponding to events with values you do not wish to override may simply be deleted from the file. MMMMIIIIXXXXEEEEDDDD CCCCPPPPUUUU OOOOPPPPTTTTIIIIOOOONNNN The following is an option for systems with both R10000 and R12000 CPUs. ----TTTT Allows experienced users to use ppppeeeerrrrffffeeeexxxx on a system of mixed CPUs. Although ppppeeeerrrrffffeeeexxxx cannot verify it, the specification of this option means that you have used either ddddppppllllaaaacccceeee(1) or some other means to ensure that the program is using either all R10000 CPUs or all R12000 CPUs. When used with this option, the ----yyyy option will not produce cost estimates due to the fact that the cost estimation cannot know which type of CPU is actually targeted. Nothing prevents you, however, from loading a cost table with ----cccc. This cost table could be directly dumped from a pure- R10000 or pure-R12000 system, depending on which CPU flavor the program is running. CCCCHHHHAAAANNNNGGGGEEEE IIIINNNN BBBBEEEEHHHHAAAAVVVVIIIIOOOORRRR OOOOFFFF DDDDEEEEFFFFAAAAUUUULLLLTTTT EEEEVVVVEEEENNNNTTTTSSSS Because of limitations of ABI/API compliance with Irix version 6.5/R10000 in the operating system counter interface, it is only possible to count cycles and graduated instructions on counter 0. Accordingly, when the R12000 user specifies an event in the range 0-15 to ppppeeeerrrrffffeeeexxxx, either through a ----eeee argument or environment variables, cycles cannot be counted simultaneously with that event as they can on the R10000. (ppppeeeerrrrffffeeeexxxx only multiplexes events for the ----aaaa option, never for individually specified events). In these cases ppppeeeerrrrffffeeeexxxx will count event 16 (executed prefetch instructions) as the second event. For similar reasons, ppppeeeerrrrffffeeeexxxx no longer remaps events 0, 15, 16, and 17 to fit them on two (R10000) counters, since that would induce a different behavior for identical arguments on R10000 and R12000 systems. It would create problems when mixed-CPU systems are supported. To be specific, prior to 6.5.3 a user could specify: %%%% ppppeeeerrrrffffeeeexxxx ----eeee 0000 ----eeee 11115555 aaaa....oooouuuutttt This would execute as if the user had specified: %%%% ppppeeeerrrrffffeeeexxxx ----eeee 0000 ----eeee 11117777 aaaa....oooouuuutttt or %%%% ppppeeeerrrrffffeeeexxxx ----eeee 11115555 ----eeee 11116666 aaaa....oooouuuutttt PPPPaaaaggggeeee 9999 PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) PPPPEEEERRRRFFFFEEEEXXXX((((1111)))) After Irix version 6.5.3, this argument combination is an error, and the user must decide which of the equivalent (for R10000 only) forms to use. It is the lack of equivalence for R12000 that makes this regression necessary. FFFFIIIILLLLEEEESSSS ////eeeettttcccc////ppppeeeerrrrffffeeeexxxx....ccccoooossssttttssss DDDDEEEEPPPPEEEENNNNDDDDEEEENNNNCCCCIIIIEEEESSSS ppppeeeerrrrffffeeeexxxx only works on an R10000 or R12000 system. Programs running on mixed R1000 and R12000 CPUs are not supported, although specifying the ----TTTT option will permit you to verify that only CPUs of the same type are being used. Usually, ppppeeeerrrrffffeeeexxxx prints an informative message and fails on mixed CPU systems. For the ----mmmmpppp option, only binaries linked-shared are currently supported; this is due to a dependency on lllliiiibbbbppppeeeerrrrffffeeeexxxx....ssssoooo. The options ----ssss and ----mmmmpppp are currently mutually exclusive. LLLLIIIIMMMMIIIITTTTAAAATTTTIIIIOOOONNNNSSSS The signal control interface (----ssss) can control only the immediate target process, not any of its descendants. This makes it unusable with multiprocess targets in their parallel regions. SSSSEEEEEEEE AAAALLLLSSSSOOOO rrrr11110000kkkk____ccccoooouuuunnnntttteeeerrrrssss(5), lllliiiibbbbppppeeeerrrrffffeeeexxxx(3C), ttttiiiimmmmeeee(1), ttttiiiimmmmeeeexxxx(1) PPPPaaaaggggeeee 11110000